I would like to predict if high scoring determines if you are on a good team or not. A lot of times there are players who score a lot but aren’t on a good team. For many people including fans and scouts it will determine a lot about the player if he is not a good team or not. If that player is on a good team chances are that player has a great upside and potential. I will determine if high scoring means your are on a good team or not. I will do this through my second data source that stores the top 25 ranked teams in the nation and will use data from my first deliverable.

Here is my second data source which includes the ranked teams. I obtained this data source via web scraping. I used gsub functions to take out unwanted data so they can match and have the same format from my first dataset.After that I merged the two datasets by school.

rankings_web <- read_html("https://www.ncaa.com/rankings/basketball-men/d1/associated-press")

rank <- rankings_web %>%
  html_nodes("tbody") %>%
  html_nodes("tr")

Rank <- rank %>%
  html_nodes("td:first_child") %>%
  html_text() %>%
  as.integer()
Team <- rank %>%
  html_nodes("td:nth_child(2)") %>%
  html_text()
Points <- rank %>%
  html_nodes("td:nth_child(4)") %>%
  html_text()
Record <- rank %>%
  html_nodes("td:nth_child(3)") %>%
  html_text()

Rankings <- cbind.data.frame(Rank=Rank, Team=Team, Points=Points, Record=Record)

Rankings$Team <- gsub(" \\(55\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(4\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(9\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(5\\)", "", Rankings$Team)


colnames(Rankings)[colnames(Rankings)=="Team"] <- "School"
colnames(Rankings)[colnames(Rankings)=="Points"] <- "Team_Points"
colnames(Rankings)[colnames(Rankings)=="Rank"] <- "Team_Rank"
colnames(Rankings)[colnames(Rankings)=="Record"] <- "Team_Record"

Rankings$Team_Points <- gsub(",", "", Rankings$Team_Points)
Rankings$Team_Points <- as.double(Rankings$Team_Points)


Top25 <- merge(x=NCAAData, y=Rankings, by="School", all.x=TRUE)

Printing out NCAAData Table.

(Top25)

Printing out Rankings Table.

(Rankings)

To get a good idea here is a table of the Ranked teams and their total scoring I used from my second dataset.

ggplot(data=Rankings, aes(x=Points)) +
  geom_jitter(aes(y=School, color=School))+
  labs(title="Top 25 Team Ranked Scoring", x="", y="") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here I tried to use Points and other columns as an indicator to predict if high scoring determines how good the team is but I did not yield good results. However Rebounds gave me a good indicator to predict that good teams depend on rebounding. In order to use points as a good predictor I would need more paramaters or data. After using multiple variables this model did not prove to be useful becuase my R-value is much too small to make a prediction.

set.seed(385)
top <- filter(Top25, !is.na(Top25$Team_Points))
sample_selection <- top$Team_Rank %>%
  createDataPartition(p=0.75, list=FALSE)
train <- top[sample_selection, ]
test <- top[-sample_selection, ]
train_model <- lm(Team_Rank ~ Minutes_Played + Points + Total_Rebounds + Assists + Steals + Field_Goal_Average, data=top)
summary(train_model)
## 
## Call:
## lm(formula = Team_Rank ~ Minutes_Played + Points + Total_Rebounds + 
##     Assists + Steals + Field_Goal_Average, data = top)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3368  -5.6694  -0.3223   5.7816  12.9615 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.1294833  1.2050011  12.556  < 2e-16 ***
## Minutes_Played      0.0003695  0.0002407   1.535  0.12517    
## Points             -0.1811269  0.0751599  -2.410  0.01617 *  
## Total_Rebounds     -0.1765421  0.0657867  -2.684  0.00743 ** 
## Assists            -0.2357444  0.1304335  -1.807  0.07105 .  
## Steals             -0.5375429  0.2963778  -1.814  0.07007 .  
## Field_Goal_Average  0.2182486  0.1011466   2.158  0.03123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.729 on 854 degrees of freedom
## Multiple R-squared:  0.02598,    Adjusted R-squared:  0.01914 
## F-statistic: 3.797 on 6 and 854 DF,  p-value: 0.0009684
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.01696989

Here I found that that Win Shares per 40 minutes (the entire game time + if they win) is a good indicator to predict that good teams have decent win shares. It was the best predictor I could find although the R value is still too small to make a good prediction.

train_model <- lm(Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes, data=top)
summary(train_model)
## 
## Call:
## lm(formula = Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes, 
##     data = top)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3642  -5.4257  -0.5292   5.7096  12.6978 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               14.53983    0.62013  23.447  < 2e-16 ***
## Total_Rebounds            -0.03405    0.05764  -0.591    0.555    
## Win_Shares_per40_Minutes -19.35524    3.58854  -5.394 8.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.663 on 858 degrees of freedom
## Multiple R-squared:  0.04069,    Adjusted R-squared:  0.03846 
## F-statistic:  18.2 on 2 and 858 DF,  p-value: 1.819e-08
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.02895749